1 This is not really a proper bug report, but I thought I should post this here
2 in case someone can find any sane, non-supernatural reason for a strange case
3 of data loss I have experienced with git-annex.
5 Some time ago I cloned a bunch of git-annex repos from an external drive (let's
6 call it disk1) to a new computer (computer3). On one of my repos git-annex
7 marked a bunch of files corrupt and moved them to .git/annex/bad. Oops, I
8 thought, I must have a failing disk. Luckily I had offsite backups -- no less
9 than two other external hard disks (disk2-3), each having a full copy of the
10 repo in question. However, **both of these** had the same, corrupt files. The
11 files have the correct size, but are filled with zeroes. Other files in the
12 repo are fine, and so are other repos.
14 I have been trying to wrap my head around this but I can't think of any reason
15 how this could occur. However the files have gotten corrupted in the first
16 place, the corruption should have been picked up when copying the content to
17 the external drives disk2 and disk3, right? I have to rule out NSA/MIB/aliens
18 from messing with me because these files are not that valuable or sensitive.
20 The files in question were added to git-annex back in 2012, so the trail is
21 cold on this one. Naturally, I have no idea on how to reproduce this, nor can I
22 reliably say that git-annex is to blame. I can gather some hints though. The
23 files were all added on the same commit in 2012, but not all files from that
24 commit are corrupted. The corrupted files have consecutive file names. The
25 files were never modified since (except for the corruption), and the content
26 *may* have been copied via an encrypted rsync transfer repository. I have
27 always used git-annex on Arch Linux and in indirect more. The files used the
30 All these files have a similar tracking log that looks something like this
31 (uuids replaced with symbolic names):
33 1356690700.542152s 1 computer1 <- first added
34 1356691074.253815s 1 disk1 <- copied to disk1
35 1356719321.145126s 1 rsync <- copied to rsync repo
36 1358070999.435676s 1 rsync <- copied to rsync repo (again?)
37 1362166895.310332s 1 disk2 <- copied to disk2
38 1362906850.555869s 1 computer2 (dead) <- copied to another computer
39 1364926664.362195s 0 computer1 <- dropped from computer1 as enough copies in disks
40 1374412057.409496s 0 computer2 (dead) <- dropped from computer2, now dead
41 1445691595.764108s 1 disk3 <- copied to disk3
42 1445770764.165792s 0 rsync <- dropped from rsync repo to save space
43 1482077052.217353646s 0 disk1 <- first noticed as corrupted on disk1
44 1482741278.318274404s 0 disk3 <- WTF, also corrupted on disk3
45 1482926246.268440532s 0 disk2 <- double-WTF, also corrupted on disk2
47 The only thing that strikes odd to me is the double entry with the rsync
48 remote. The non-corrupted files from the same commit do not seem to have such a
51 So my main question is, has there ever been a bug in git-annex that could have
52 caused this behavior? Or is there any other realistic explanation for this? In
53 case this is an existing bug, is there any other evidence I can gather?
54 Needless to say, the lesson here is to run `git annex fsck` regularly even if
55 you have offsite backups...
57 > My diagnosis is that the file got corrupted before it was copied to disk2
58 > and disk3. What repository they reached them via does not matter much.
59 > And indeed, 5 year old git-annex didn't verify the content of
60 > files it transferred to/from a remote. Current git-annex does, so I guess
61 > this is [[done]]. --[[Joey]]